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ABSTRACT 

A  major  problem  in  achieving  significant  speed-up  on  parallel  machines  is  the  overhead 
involved  with  synchronizing  the  concurrent  processes.  Removing  the  'synchronization  con¬ 
straint  has  the  potential  of  speeding  up  the  computation.  We  present  asynchronous  (AS) 
and  corrected-asynchronous  (CA)  finite  difference  schemes  for  the  multi-dirncnsional  heat 
equation.  .Although  our  discussion  concentrates  on  th?  Euler  scheme  for  the  solution  of  the 
heat  equation,  it  has  the  potential  of  being  extended  to  other  schemes  and  other  parabolic 
PDFs.  These  schemes  are  analyzed  and  implemented  on  the  shared-memory  multi-user 
Sequent  Balance  machine.  Numerical  results  for  one  and  two  dimensional  problems  are  pre¬ 
sented.  It  is  shown  exijerimentally  that  synchronization  penalty  can  be  about  50%  of  run 
time:  in  most  cases,  the  asynchronous  schem'’  runs  tvvice  as  fast  as  the  parallel  synchronous 
scheme.  In  general,  the  efficiency  of  the  parallel  schemes  increases  with  processor  load,  with 
the  time-level,  and  with  the  problem  dimension.  The  efficiency  of  the  AS  may  reach  90% 
and  over,  Irut  it  provides  accurate  results  only  for  steady-state  values.  The  CA,  on  the  other 
hand,  is  less  efficient  but  provides  more  accurate  results  for  intermediate  (non  steady-state) 
values. 
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1  Introduction 


As  parallel  machines  become  more  popular,  most  algorithms  used  on  parallel  machines 
still  rely  heavily  on  synchronizing  the  concurrent  processes.  There  is  an  inherent  inefficiency 
in  the  synchronization  requirement,  which  is  two-fold.  First,  a  fast  processor  is  delayed  until 
the  slowest  processor  finishes.  Thus,  the  pace  of  the  algorithm  is  dictated  by  the  slowest 
processor.  There  are  various  reasons  why  certain  processors  will  be  ahead  of  the  others,  even 
when  they  are  pliysically  configured  at  the  same  speed. 

•  Randcni  noiai, 

.^ny  unpredictable  perturbation  that  can  cause  a  momentary  delay. 

•  Multi-user  environment 

In  such  an  environment  many  tasks  are  simultaneously  assigned  to  each  processor. 
Since  a  single  processor  can  be  used  for  any  of  those  tasks  at  any  particular  time,  we 
cannot  predict  how  much  will  be  devoted  for  the  computation  of  our  algorithm. 

•  Master/Slave 

In  such  an  environment  one  processor,  the  master,  is  doing  additional  tasks  to  those 
performed  by  the  other  processors  for  example,  i/o  operations,  scheduling  or  load 
balancing  tasks.  Therefore,  it  may  be  slow  in  performing  our  algorithm. 

•  Load  Balancing 

In  many  problems  it  may  be  logical  to  divide  the  problem  unequally  among  the  proces¬ 
sors  due  to  different  boundary  conditions  or  approximation  sf'hemes  that  are  used.  In 
general  the  partition  of  nodes  among  processors  is  dictated  by  numerical  and  physical 
considerations  rather  than  just  by  computer  architecture  considerations.  This  extends 
the  duration  of  an  iteration  for  those  processors  that  are  assigned  to  do  more  work  and 
therefore  those  would  be  slower  than  the  others. 

Second,  there  is  a  delay  period  associated  with  the  synchronization  mechanism  itself 
whether  it  is  setting  the  semaphores  in  a  shared  memory  environment  or  waiting  on  a  message 
to  arrive  in  a  message-passing  environment,  or  any  other  possible  implementation  (i.e.  when 
possible,  setting  a  time  bound  on  the  duration  of  an  iteration,  so  that  the  next  iteration 
starts  only  after  this  time  bound  is  exhausted).  No  :....+ter  what  iniplementation  is  used, 
;i  ,,l^^T;on  (..iiaunel  siows  tile  progress  of  the  entire  computation.  Moreover,  in 

some  situations  synchronization  may  cause  contentions  over  communication  resources  and 


memory  access,  that  require  careful  implementation  using  additional  mechanisms  such  as 
locking. 

A  particular  case  where  tasks  are  to  be  repeated  is  that  of  iterative  computation.  Numer¬ 
ical  properties  of  iterative  solutions  to  PDEs  are  usually  based  on  the  assumption  that  the 
iterations  are  synchronized.  This  is  equivalent  to  assuming  th.at  the  algorithm  is  governed 
by  a  global  clock  so  that  the  start  of  each  iteration  is  simultaneous  for  all  processors.  In 
implementing  a  synchronous  algorithm  in  an  inherently  a.synchronous  architecture  a  synchro¬ 
nization  mechanism  is  used  in  order  to  guarantee  the  correct  execution.  This  considerably 
degrades  the  efficiency  of  those  algorithms.  An  asynchronous  algorithm  can  potentially  re¬ 
duce  the  synchronization  penalty  since  each  processor  can  execute  more  iterations  when  it 
is  not  constrained  to  wait  for  the  most  recent  results  of  the  computation  in  other  processors. 
In  addition,  asynchronous  algorithms  eliminate  the  programming  efforts  involved  in  setting 
up  and  debugging  the  synchronization  mechanism  and  also  simplify  the  task  management. 

The  usage  of  asynchronous  iterations  for  an  iterative  solution  of  systems  of  linear  equa¬ 
tions,  is  due  to  [()].  .Asynchronous  Iterative  Methods  for  Multiprocessors  are  also  di.«cus.scd 
in  [1],  and  aie  used  for  the  solution  of  ordinary  differential  equations  by  [11].  In  this  paper, 
we  present  an  asynchronous  iterative  methods  for  the  solution  of  partial  differential  equa¬ 
tions.  These  melliods  are  based  on  Euler  explicit  finite  difference  schemes.  In  Section  2  we 
present  the  parabolic  PDE  which  will  serve  as  our  Jiiodel  problem,  the  corresponding  finite 
difference  scheme  and  its  asynchronous  parallel  modification.  Section  .3  analyzes  our  asyn¬ 
chronous  scheme  and  d(?termines  the  conditions  under  which  the  asynchronous  iterations 
work.  To  compensate  for  the  inaccuracy  of  non  steady-state  values  of  this  asynchronous 
scheme,  we  j)ropose  and  discuss  in  .Section  d  the  corrccted-asynchronous  scheme,  that  while 
still  being  asynchronous,  performs  some  extra  extrapolation  calculations.  Numerical  results 
are  presented  in  Section  5.  Finally,  our  results  are  summarized  in  Section  6. 


2  The  Problem  and  its  Asynchronous  Solution 


We  demonstrate  our  approach  for  asynchronous  solution  on  the  multi  dimensional  heat 
(•([nation.  The  same  appr'^=>ch  can  be  (^xtendc'd  and  generalized  to  other  types  of  problems, 
for  our  mode)  proldem  vve  consider  tlie  simple.st  parabolic  cquat.mu,  the  heat  equation  in 
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in  a  rectangular  domain,  with  accompanying  initial  and  boundary  conditions,  where  a\ 
(f  fl:  f/)  <‘‘e  constant  [positive  coefficients.  For  our  model  [uoblem  we  consider  Dirichlet 
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Figure  1:  Model  problem  -  Domain  discretization 


boundary  conditions.  Fiowever,  other  conditions  are  ccpially  applicable.  A  one  dimensional 
model  problem  is, 

~  =  al  for  0  <  a:  <  F  0  <  t  <T 
at  dx^ 

with  initial  coiiditiori, 

u(x,0)  =  f{x)  0  <  .T  <  1 
and  Dirichlet  boundary  conditions, 

u(0,f)  =  5,(:)  0<<<T 


u(l,t)  =  52(F)  0<f<T 

where  a\  is  a  constant  positive  coefficient. 

After  discretizing  the  domain  (see  Fig.  1)  we  approximate  Eq.  (1)  at  grid-point  (x,',i„) 
by  the  forward  Euler  finite  difference  approximation; 


^  At 

w^here,  r,,  =  I  <  s  <  d  are  held  as  constants;  i  (1  <  i  <  p)  denotes  the  index  of 

the  spatial  coordinate  x  of  a  particular  grid  point;  d  denotes  the  dimension  of  the  problem; 
and  the  opfuator  denotes  a  central  difference  in  the  direction  of  s. 

The  one  dimensional  version  of  this  scheme  is, 


,,n+l  __  ..." 


..n 

'  V‘n-1 


\ 
i  / 


where,  r  is  held  constant,  and  the  values  of  vf  ( 1  <  i  <  p),  r^,  and  are 

determined  F,y  the  initial  and  boundary  conditions,  respectively. 
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It  is  well  known  that  the  scheme  of  Eq.  (2)  is  stable  if 


(3) 

S  “* 

Although  the  following  discussion  concentrates  on  the  Euler  scheme  for  the  solution  of  the 
heat  eciuation,  it  I'hs  the  potential  of  being  extended  to  other  schemes  and  other  equation:. 
In  this  [uiper  we  only  consider  the  simple  scheme  (2).  Extensions  to  other  explicit  schemes 
are  straightforward.  We  emphasize  in  this  study  tlie  analysis  of  asynchronous  schemes. 
Comi)arisons  for  .ADI  schemes  will  be  performed  in  a  work  in  [jrepaiation. 


2.1  Asynchronous  (AS)  Model 

W'e  now  present  llio  mode!  of  ai-ynchronou.r>  iteratioiu'i  to  the  numerical  problem  described. 
The  deiinitiou  of  chaotic  iteration  was  presented  in  [b].  The  formal  definition  of  asynchronous 
iteration  presented  l)elow  is  a  modified  version  of  what  is  discussed  in  [4],  [7],  and  [II]. 


Definition  1  hit  1\  bt  a  computational  task  to  he  done  on  the  component  of  a  given 
vtctor  u.  taking  as  its  arguments  components  of  v  in  the  neighborhood  of  i.  Asynchronous 
itevHtions  [F.  ii.  J,  A)  corresponding  to  these  tasks  and  starting  with  a  given  vector,  vP,  are 
a  sequence  of  vectois  defined  recursively  by: 


(4) 


ir  l  n+d{2,i,n) 

r ,  I u j  ,  ^2  1 


if  if.  Jn 


where,  u  is  a  global  value  of  an  ilf-^tion  level,  a  —  {uy....,ui),  F  =  (T\, . . . ,  F^,)  ,  J  = 
{Jn\n  =  1,2....]  is  a  sequence  of  non-empty  subsets  of  L] ,  A  =  {/Ij, . . . ,  Ai],  and 

.4,  IS  a  sequence  of  elements  in  /I,  =  {{d{l,  i,ri), . . .  ,d{L,  i,n))\n  =  1,2,...}  Vi  = 

I . L. 


Xo  a^sumjPions  are  made  on  the  relations  between  the  calculations  of  the  grid  points, 
exce[)t  a  non-starvation  condition  which  guarantees  that  the  grid  point  is  updated  an 
infinite  number  of  times,  i.e  i  occurs  infinitely  often  in  the  sets  ./„.  This  means  that  no  point 
IS  abandon'd  forever,  and  consequently,  no  processor  is  held  forever  executing  the  same 
iteraticui.  .Synchronous  iterations  are  obtained  from  this  model  when  d{k,i,n)  =  0  'dk,i,n. 

A  single  or  several  grid  points  may  be  a.ssigned  to  cacli  proces.sor  of  a  MIMD  machine  with 
})  asy  m  hroriou''  processors.  For  simplicity  al.so  «.s.sume  that  values  stored  by  each  processor 
are  available  to  the  rest  by  means  of  a  shared  memory.  4el,  with  some  minor  modifications 
this  model  is  equally  aj)[)licable  to  message-passing  architectures,  wh('re  in  fact,  the  overhead 
of  synchronization  is  more  significant. 


Let  us  now  take  the  local  point  of  view  and  consider  the  grid  point  which  is  assigned, 
perhaps  with  some  of  its  neighbors,  to  a  specific  processor.  We  present  the  following  modified 
asynchronous  difference  equation  for  the  multi-dimensional  heat  equation: 


(5) 


,n,  +  l 


Here,  n,  denoies  the  last  completed  iteration  at  the  grid  point  and  Sju’i'  is  a  central 

difference  of  the  value  of  iq-  at  the  iteration,  using  the  most  recent  available  values  of  the 

neighboring  points.  In  the  coordinate  direction  the  neighboring  values  are  denoted  by 
,  -  ,0^1 

"  and  ,  respectively 

The  values  of  q,"’  and  correspond  to  the  delay  or  advancement  of  the  iteration  number 
of  the  neighboring  grid  point  along  the  s-coordinate  axis,  relative  to  the  grid  point,  when 
evaluating  its  (n,  +  1)*^  iteration. 

For  example,  in  one  dimension  we  obtain 


n.  +  l  Hi  I  /  4-)  n*  I  '  \ 

u,-  -  iq-  -b  r(zq_i  -  2n,-  +  tq+i  ) 


where,  r  = 

For  our  analysis  we  require  a  and  /?  to  be  bounded,  i.e.  no  node  falls  infinitely  behind 
its  neighbors.  VVe  note  that  for  each  iteration,  a  and  i3  are  functions  of  i,  the  number  of  the 
processor,  and  not  of  x.  Hence  as  we  add  more  intermediate  points,  the  difference  between 
the  adjacent  iteration  levels  does  not  approach  zero. 


3  Analysis  of  the  Asynchronous  Scheme 

Lemma  1  When  a  and  (3  are  bounded,  the  asynchronous  finite  difference  approximation  Eq. 
(5)  is  consistent  with  the  following  heat  equation: 

du 


(6) 

where,  K{x,  t)  — 


dt 


[Vu] 


- ,  and  Xg 


a%At 


are  constants. 


Proof:  Taylor  scries  expansion  of  Eq.  (5)  yields. 


(7) 

Thus, 


original  schemf: 


■  / 


perturbatio^i 

+0[(A<)^  +  Ef=,(AO(A.r,)] 


[1 +  /?:■)]- 


-  «"• 


At 


^.=■(57:) 
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approximates  Eq.  (6)  with  truncation  error  of  0[(A^)  +  g 

The  coefficient,  A',  in  Eq.  (6)  depends  on  and  and  thus  can  become  negative.  It 
is  well-known  that  the  multi-dimensional  heat  equation  Eq.  (6)  is  well-posed  for  a  positive 
coefficient,  K  >  0,  and  can  be  ill-posed  for  a  negative  coefficient.  Thus,  a  sufficient  condition 
for  well-posedness  is  l  —  J23^'s{o:8d-ids)  >  0.  Consequently,  for  the  special  ca.se  where  =  r  Vs 
and  r  <  we  obtain,  -f-  /da)  <  2d.  However,  this  leads  to  a  severe  restriction  on  a  and 

j3  so  the  scheme  is  not  completely  asynchronous.  A  weaker  condition  for  the  well-posedness 
of  Eq.  (6)  can  be  set,  reciuiring  positiveness  only  in  some  average  sense. 

Lemma  2  A  sufficient  condition  for  the  well-posedness  of  Eq.  (6)  for  large  times  when 

E'i(<)  <  K{x,t)  <  A'2(t)  Vx 

is 

(  Ki{^)d^  >0  for  f  large 
J  0 

Proof;  Assume  K{x,t)  satisfies,  Ki{t)  <  K{x,t)  <  K2[t)  Vx.  Using  separation  of 
variables  we  obtain  a  general  solution  to  the  equation  Vt  =  Ki{t)'V  •  [Vu]  : 

where,  P{t)  =  foEii^/d^  and  V  ■  [V<p„]  +  \'~Pn  —  0.  <^n(T)  are  the  appropriate  eigenfunctions 
of  the  steady-state  equation  with  eigenvalue  >  0.  Accordingly,  =  A'2(t)V  •  [Vu]  has  a 
solution 

For  the  one-dimensional  case  we  obtain,  A„  =  >  0  ;  =  siffiniTx)  and  for  two 

dimensions,  A„  =  -t-  >  0  ;  (pnix.,y^  ~  sin{n7rx)sin{m7ry). 

For  large  times  only  the  first  mode  (fiix)  is  important.  In  this  case  the  equation  is 
reducible  to  an  ordinary  differential  equation  ana  hence 

<  u(x,t)  < 

Thus,  sufficient  conditions  for  u  to  converge  to  a  steady-state  are; 

1.  P(t)  >  —S  Vf  for  some  S 

2.  Aa  ~  n  -o  oo 

3.  lim(_rx;, AlO  = 
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Figure  2;  Well-poserlness 


A  possible  way  of  achieving  convergence  is  restricting  K  >  0  at  the  beginning.  Tliis  ensures 
that  P{t)  is  initially  positive  (see  Fig.  2).  If  cond.  2  is  not  true  and  P{t)  is  negative  for 
some  t,  then 

V  •  [VU]  = 

may  not  converge.  To  enforce  cond.  2  we  can  start  synchronously  and  only  later  let  the 
processors  run  asynchronously.  Another  possibility  in  case  P{t)  becomes  negative,  P{t)  > 
—6  'it.  is  to  filter  the  initial  conditions,  to  damp  the  high  frequencies 

u(x,0)  =  f{x)  - 

and  An  satisfies  cond.  2,  e.g  An  =  for  n  >  N .  | 


Lemma  3  The  asynchronous  finite  difference  scheriie  Eq.  (5)  is  stable  for 

^  1 

5=1  “ 

Proof;  .Assume  that  the  nff  iteration  of  the  grid  point  had  been  completed  and  its  next 
iteration  is  currently  being  performed  (see  Fig.  .3).  Since  (3)  is  valid,  then  the  absolute 
values  of  the  coefficients  of  u  values  on  the  right  hand  side  of  Eq.  (5)  sum  to  1. 


I  ^  t  I  o  ^  1 

71  4-1  I  ^  ri  n  I  I  I  I  I 

+  <  max{|rq'|,|tq_,,  ’  '  I 


<s<d} 


Using  the  last  inequality  recursively  by  backtracking  the  origins  of  each  relevant  grid  point, 
we  finally  reach  the  initial  values,  thus 


<  maXj|i/° 


I 

By  lemma  1,  lemma  3,  and  the  Lax  Equivalence  Theorem  the  scheme  is  convergent  in 
the  maximvim  norm.  (We  also  note  that  this  proof  applies  to  any  scheme  for  which  all  the 
coefficients  are  between  0  and  1 .  For  parabolic  type  of  equations,  higher  order  schemes  can 
be  constructed  with  this  property.)  Hence, 


Figure  3:  Stability  analysis  -  processors  status 


Lemma  4  The  asijnchronous  finite  difference  scheme  is  conveinjent  to  Eq.  (6),  under  the 
same  restrictions  as  Lemma  3. 

This  approach  for  proving  the  convergence  of  the  AS  scheme  was  separately  derived  and 
used  by  the  third  author  in  the  paper  [1],  and  it  was  presented  in  the  [10]. 

Assume  now  that  the  solution  of  the  asynchronous  scheme  in  Fcp  (.5)  had  been  completely 
evaluated  up  to  a  time  level  T  —  NAt,  and  it  is  now  interpolated  into  a  smooth  function 
w(x,t)  over  the  entire  region  under  discussion.  We  denote  by  v  the  discrete  solution  of  the 
synchronous  solution  of  Fq.  (2)  and  assume  that  both  iv  and  i>  are  satiafying  the  required 
boundary  and  initial  conditions.  Let  c"  be  the  difference  between  them  at  an  interior  grid 
point  I\  for  a  global  value  of  n; 

(8)  4  -  le"  -  u" 


and  let. 


be  the  corresponding  difference  vector,  in  some  specific  order,  associated  with  all  interior 
spatial  grid  point.s  T,  (1  <  f  <  /)  at  the  same  global  time  level  t  =  nAt  {n  <  N).  In 


addition,  let 

(9) 


where  n"  and  .3"^  are  the  corresponding  delays  or  advancements  d{k,icn)  in  Eq.  (4)  of  the 
neighbors  of  the  grid  point  along  the  .s-coordinate  axis,  while  had  been  evaluated, 

d'hen  we  can  prove  the  following  lemma  which  bounds  t",  given  a  specific  grid  spacing. 
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Lemma  5  If  a  and  d  art  boiindul  by  M  and 


(10)  v7,/, 
tin  n  for  iiAt  <  /’ 

(11)  <  V  max|rf+'  -  rf|  +  T  ■  0[{Al)  + 

k  =  0  '  s 

Remark  1  The  lust  inequality  should  be  read  as  follows:  there  exists  a  constant  R  (which 
IS,  in  fact,  the  bound  on  the  second-order  derivative  of  w[xJ))  such  that 

|i.1'*+i)!|  <  _\{  ^2  ma,x|yf+'  -  ;-f|  +  /^  /-[(A/)  +  ^(A.r,;) 

Jt=o  ‘ 

Jor  all  sujjiclently  small  A/  anel  Ax^  (I  <  .s  <  el).  .\ot„  that  R  ele pends  on  U'l.r.t),  u'lnch  in 
turn  depends  on  the  mesh  spacing. 


Proof:  C  sing  I  lie  Taylor  1  lieorem  of  t  he  Mean  and  neglect  ing  tcM'iiis  (d'  11  ( A  /  )'-f- A,r,, )( A/' )] 
we  obtain  from  Eqs.  (5)  and  (8), 


dr. 


(12) 


c*'  =i:y, 

+lm  2i;y,,-, +  e.!le' 


in  which  c,Ai,  and  denote  the  difference  values  at  the  neighi>ors  of  the  grid  point 

along  the  a-coordinate  axis.  In  a  matrix  form  one  obtains. 


(13) 


^n+l)  _ 


In  Eq.  (13)  is  a  (2ri  +  l)-diagonal  matrix  with  main  diagonal  terms  of  the  form 

I  + j:i) 

and  d  additional  diagonals  on  each  side  with  terms  of  the  form 

1  <s  <  d 


E;=,r.(o"  + 


while  if’d  is  a  vector  whose  components  are  giv-  n  Iw 

xy,  --..(o"  +  XU 


fh 


-  E.=  ,  ^  O'J 


(< 


1  ,  n  > 


(11) 


Following  is  ail  oxampU'  of  a  9  x  9  matrix,  corrospoiuling  in  some  specific  order  to  a  3  x  3 
interior  grid,  for  tlu'  tw<.  dimensional  problem: 


l-t-2 

\  —L  I  1  —  i/;j 


in  whicli  /- 


1  -  lO 

1  —  1  — 1.'4 

rp  I 

ry 

_ Q_ 

l~l^4 

'•i . 

1  1  —  7-'* 

i-l'z 

l-i'5 

^'2 

^-<'6 

I  —  Ixf, 

1  —L'e 

'  -1^6 

_Z1_. 

ir±L 

r  ^ 

1-C7 

1-i/T 

r- 

To 

- 

1  - 1'8 

i-i/g 

1  —  1/8 

-  -  «^8 

ri 

-  ^2^ 

1-^9 

1  -t/g 

1  —  t/g 

2(r]  +  r2)  +  n, 

,  and 

.^'9  - 


-71 

^32 
■^  3  3  / 


where  is  the  error  of  tlm  grid  point  ai  the  row  and  the  column  at  the  time  level. 

It  follows  from  Fq.  (10)  that  all  the  elements  of  A*"'  are  positive  and  that  =  1. 

.Since  =  0,  \.e  obtain  from  Eq,  (13)  that 

(in 


In  addition,  — ^ - =  - so  that, 

I  E..=l  4-  A  J _ ,  _ '}i  . /  n  ,:^n  o  ^  O^p/'v" 

‘  1  -  L.U  --.(a".  +  arS  -  Ef-_,  ..'s’'--'-''  -  " 

Then  (11)  follows  from  the  last  in'^quality,  from  Eq.  (10)  and  from  Eq.  (15).  | 

.Note  from  Fap  (12)  that  in  'he  steady-state  case  when  —  n"  — >  0,  does  not 

increase  as  /?  — >  oc.  negh'cting  term  >  ot  0[(  Al)^  +  Xf3(‘i^3:s)(  Af )]. 


4  Corrected- Asynchronous  (CA)  Scheme 


\\’(>  have  shown  that  llie  asynchronous  iterations  are  consistent  with  a  PDE  that  is 
liiferont  than  the  original  one.  Consequently,  time  accuracy  is  lost,  dlils  suggests  that  if  we 


apply  a  correction  during  each  iteration  we  can  improve onr  approximation,  without  re([uiring 
explicit  synchronization.  For  each  processor  we  now  recitiire  an  extra  variable  that  will  serve 
as  an  iteration  counter  for  that  processor.  'I’he  correction  we  apjtly  is  an  extrapolation  based 
on  those  variables. 

Hence,  we  construct  a  modilied  equation  who.se  asynchronous  approximation  is  time 
consistent  with  tin'  original  Fcp  (1)  by  subtracting  the  perturbation  term  of  Eq.  (7), 


?■, -El  n,  I  t2  fi,  {  + 1  I  \ 

-  (»,  -  ‘b 

where,  b\  denotes  a  central  dilference  in  tin'  s|)ace  coordinate.s,  with  each  point  in  its  own 
cornph'teil  time  le\('l.  1  hns,  our  Correctetl-. Asynchronous  (C’A)  scheme  is: 


,  -f-  1  '<1 

=11. 


provided  t  lia ; . 

(IT) 


Lemma  6  /)  <>"'  (uxl  art  houndtd'in.  i.s  Ihfn  Iht  Cornett d-Asijncltronous  approxima¬ 
tion  Kip  (10)  /.s  consistt  nt  with  I'.q.  (I). 

Proof:  (  sing  the  lavlor  l  ln'orem  of  the  Ah'an  wo  have, 

+(0';;  +  -//''■) +  b>(A.r.)(A/)  +  (;(A<^) 

Subst  it  ut  ing  in  F(j.  (16)  we  obtain. 


‘  =  C'  +  Lbsf'f'b'"  +  LO(A,r,)0(A/)  +  0{At^) 


b,-fl  ,,n.,  <l  v2,/U  'I 


II ,  —  U 


.5  —  1  A  '  •'*  /  s=\ 


Henc<'  the  sclieme  (d  I'iq.  (Hi)  is  consistent  with  lAp  (1). 


Lemma  7  If 

(18) 


E.s>b  ^  1 


thtri  tilt  CA  .-^ch)  tilt  of  (10)  is  stabli  . 


Proof:  Analogous  to  the  proof  of  lemma  3.  | 

By  the  Lax  Equivalence  Theorem,  lemma  6  and  lemma  7  imply  that  the  CA  scheme  is 
convergent,  provided  that  condition  (18)  is  valid. 

Condition  (IS)  can  be  interpreted  either  as  restrictions  on  a  and  /3,  or  as  a  bound  on  r,. 
Practically,  the  bounds  on  a  and  I3  are  determined  by  various  factors  as  discussed  earlier. 
Consequent!}’,  they  impose  the  above  restriction  on  r^. 

Finally,  similar  to  lemma  5,  we  now'  show  that  at  a  given  point  of  a  specific  mesh  spacing, 
the  difference  between  the  solutions  of  Eq.  (16)  and  Eq.  (2)  is  bounded  by  C)[(At)+X]3(^3:j)], 
provided  that  condition  (3)  is  v'alid. 


Lemma  8  Let 


=  w?  -  V? 


where  u’"  is  the  smooth  interpolation  of  the  CA  scheme,  Eq.  (16),  solution  and  vf  is  the 
solution  of  the  synchronous  scheme  Eq.  (2)  both  until  the  same  initial/boundary  conditions. 
In  addition,  lei 

{  \ 

=  : 

be  the  corresponding  difference  vector,  in  some  specific  order,  with  all  interior  spatial  grid 
points  P,  (i  <  i  <  I )  (it  the  same  global  time  t  =  nAt  (n  <  jV).  If  (3)  is  valid  then  for 
uAt  <  T 

(13)  ll?"'IU<r-0i(Ai)  +  i;(‘^-'.)) 

s 

provided  that  (17)  holds. 

Proof:  Along  the  same  lines  of  the  proof  of  lemma  5  we  can  show  from  Eq.  (7)  that  u;(* 
satisfies  Eq.  (2)  with  an  error  of  0[(At)^  +  X)s(^3:s)(At)].  Hence,  if  (3)  is  valid  then 

+  0[{AtY  +  X:(Ax,)(At)] 

5 

Since  Hoc  =  IL  by  using  the  inequality  recursively,  we  obtain  (19). 


5  Numerical  Results 


We  have  implemented  the  above  algorithms  on  tiie  shared-memory  multi-user  Sequent 
Balance  [15].  The  Sequent  systems  are  commercial  multiprocessors,  that  incorporate  identical 
general-purpose  32-bit  microprocessors  and  a  single  common  memory.  Each  processor  can 
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execute  both  user  aiul  kernel  code.  All  processors  sliare  a  single  pool  of  memory,  in  addition 
to  their  own  local  iiH'mory,  to  enhance  resource  sharing  and  communication  among  different 
proce.sses.  Processors,  memory  modules,  and  i/o  controllers  are  all  plugged  into  a  high-speed 
bus.  Thus,  the  S(({iuut  systems  fit  our  rnotlel  of  an  asynclironous  multi-processor  with  the 
ability  to  exchange  data  between  its  processors  via  sliaretl  memory. 

The  Si:qu(  lit  systems  run  the  DY.\'[X  Ojjerating  system,  a  version  of  UNIX  that  supports 
a  multi  user  environment.  Therefore  tlie  ovc'rliead  of  a  single  ojteration  cannot  be  predicted. 
For  example,  task  creation  on  another  proc<'s.sor  involves  allocatiott  of  a  free  processor, 
allocation  oi  memory  and  tree  entries  in  system  tables,  re-mapping,  and  finally  context 
switching.  The  duration  of  allocations  of  both  iiroce.ssors  and  memory  is  unpredictable 
as  it  is  e|[('cted  by  numerous  factors,  altlu)ugli  context  switching  itself  takes  only  a  few 
liundred  machine  cycles.  .\s  such,  the  Sequent  is  an  example  of  a  machine  where  no  a  priori 
assumptions  can  be  made  regarding  relative  speeds  of  processing. 

We  investigated  a  serial  (SR)  implementation  of  F(|.  (2),  a  parallel  synchronous  (SY)  ver¬ 
sion  ot  (2).  an  ast  iichronous  (AS)  implementation  Ftp  (5),  and  corrected-asynchronous 
((.'.■\)  implementation  Fep  (Ki). 

In  our  synchronous  (SW)  version  we  delay  all  processors  until  the  last  one  fiidshes  the 
ctirrcnt  iterat  ion  using  a  barrier.  In  this  implementation  every  processor  increments  a  counter 
upon  finishing  the  current  iteration.  W'liih?  checking  the  counter  it  can  determine  whether 
it  is  the  last  to  finish.  If  it  is,  it  changes  a  special  hardware  managed  variable  to  inform  the 
other  processors  that  the  iteration  is  finished.  If  it  is  not,  it  waits  for  a  signal  from  the  last 
proccsscux  For  other  [lossible  implementations  see  [8]  and  [!)]. 

.Note  also  that  our  task  was  simplified  since  each  processor  updati's  only  his  own  specific 
\alue.s.  although  li  may  read  some  other  values.  Therefore,  there  is  no  requirement  for 
simultaneous  access  to  update  global  data.  A  special  mechanism  known  as  locking  can  be 
used  in  order  to  prevent  such  access  in  cases  wlu're  the  algorithm  permits  it. 

1  he  following  paranu’ters  were  consith’red: 

•  A.r.  A/y  -  represent  the  discretization  lengths  along  the  x-axis  and  the  y-axis,  respec¬ 
tive -ly. 

•  A/  -  rc'presents  the  discretization  length  along  the  t-axis. 

•  /  -  I  he  time'  level  to  be  reached  for  the  given  j)robiem.  We  .seek  the  solutions  for 
u(.i\t)  where'  /<'/’=  A'A/  . 

•  p  ■  Fhe  number  of  processing  elements  u.sed  for  the  calculation. 

•  L  -  Ihe  number  o'”  ‘  '  !  grid  points  in  the  considered  domain. 
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•  r  -  I'he  clock  time  elapsed  before  a  specific,  algorithm  completed  the  calculation  of  the 
given  problem. 

•  e  -  The  maximal  error  magnitude 

•  E  -  I  he  maximal  absolute  relative  error  of  the  numerical  result  (relative  to  the  analytic 
solution). 

•  scheme  -  Three  types  of  parallel  schemes  were  used:  Eq.  (2)  (SY  -  Synchronous),  Eq. 
(5)  (  AS  -  Asynchronous),  Eq.  (7)  (CA  -  Corrected  Asynchronous),  and  the  serial  (SR) 
version  of  Eq.  (2). 

We  examined  both  a  one  dimensional  problem,  as  well  as  a  two  dimensional  problem. 

The  one  dimensional  problem  considered  was 

du  d^u 
dt  dx^ 


with  the  initial  condition. 


u(x,  0)  =  sin(7ra;) 


and  the  Dirichlet  boundary  conditions, 

u(n  t)  =  u(l, i)  =  0. 

The  calculated  results  were  compared  '■  ■'h  the  analytic  result  given  by 

u{x,t)  =  sin(7rx)e“'^^^ 

For  the  two-dimensional  problem  we  examined  the  following  non-homogeneous  problem 

du  d^u  d^ii  ^ 

dt  dx^  di/ 


where, 


with  the  initial  condition, 


and  the  boundary  conditions: 


F  =  57r^  sin(7rx-)  sin(27ry) 


u{x,y,0)  =  0 


u(0,y,^)  =  0  u(l,y,^)  =  0  n{x,0,t)  =  0  ?/.(x,l,t)  =  0 

The  calculated  results  were  compared  with  the  analytic  result  given,  for  the  steady-state,  by 

Uoo  =  sin(7rx)  sin(27rj/). 
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T 

=  2(. 

3  =  10 

T 

=  3( 

e  =  10 

T 

=  4  ( 

e  =  10 

-10) 

P 

SY 

AS 

CA 

P 

SY 

AS 

CA 

P 

S^' 

AS 

CA 

10 

0.08 

0.1 

0.09 

10 

0.11 

0.15 

0.13 

10 

0.i3 

0.19 

0.15 

11 

0.1 

0.12 

0.11 

11 

0.12 

0.16 

0.14 

11 

0.14 

0.2 

0.16 

12 

0.09 

0.12 

0.1 

12 

O.il 

0.18 

0.15 

12 

0.13 

0.21 

0.17 

13 

0.09 

0.12 

0.11 

13 

0.12 

0.18 

0.15 

13 

0.15 

0.22 

0.18 

14 

0.09 

0.13 

0.12 

14 

0.12 

0.19 

0.16 

14 

0.15 

0.23 

0.18 

15 

0.1 

0.14 

0.12 

15 

0.12 

0.21 

0.16 

15 

0.15 

0.26 

0.19 

16 

0.1 

0.15 

0.12 

16 

0.12 

0.2 

0.16 

16 

0.12 

0.25 

0.2 

Table  1:  Efficiency  for  the  steady-state  one  dimensional  problem  with  one  grid  point  per 
processor;  L  —  p,  Ax  =  =  0.5 

we  also  considered  the  homogeneous  non  steady-state  problem 

du  d^u  d^u 
dt  dx^  di/ 

with  the  initial  condition, 

n(x,?/,0)  =  cos(7rx)  cos(7ry) 
and  the  following  Dirichlet  boundary  conditions: 

u(0,y,0  =  cos(7rj/)e"^''‘* 

u(l,y,0  =  -  cos(7ry)e"^’^'* 
u(x,0,f)  =  cos(7rx)e“^’^  ‘ 
u(x,l,f)  =  —  cos(7rx)e“^'"^fi 

The  calculated  results  were  compared  with  the  analytic  solution  given  by 

u(x,  y,  t)  =  cos(7rx)  cos(7ry)e'"^’"^* 

Performance  results  are  shown  in  Table  1  for  the  one  dimensional  problem  with  one  grid 
point  per  processor,  and  in  Tables  2  and  3  for  the  one  dimensional  problem  with  several 
points  per  processor,  all  for  the  steady-state  problem.  In  Table  4  we  present  results  for 
the  two-dimensional  non- homogeneous  steady-state  problem.  Non  steady-state  results  are 
shown  in  Table  5  for  the  one  dimensional  problem  with  several  points  per  processor,  and  in 
Table  6  for  the  two  dimensional  problem. 

Table  1  presents  the  efficiency,  which  is  :  by: 

Efficiency  =  . 
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p 

T  =  2  (e  =  10-'') 
SY  AS  CA 

T  =  3  (e  =  10-''^) 
SY  AS  CA 

T  =  4  (e  =  10-1") 
SY  AS  CA 

2 

0.78 

0.96 

0.53 

0.79 

0.96 

0.53 

0.79 

0.96 

0.53 

3 

0.76 

0.95 

0.52 

0.77 

0.96 

0.52 

0.77 

0.96 

0.52 

4 

0.74 

0.95 

0.52 

0.75 

0.96 

0.52 

0.74 

0.96 

0.52 

6 

0.66 

0.91 

0.51 

0.68 

0.93 

0.51 

0.66 

0.93 

0.51 

S 

0.6 

0.87 

0.49 

0.62 

0.89 

0.5 

0.62 

0.9 

0.5 

12 

0.49 

0.77 

0.49 

0.5 

0.8 

0.49 

0.5 

0.82 

0.51 

16 

0.37 

0.67 

0.41 

0.4 

0.73 

0.44 

0.75 

0.44 

Table  2;  Efficiency  for  the  steady -state  one  dimensional  problem  with  -  points  per  processor; 
L  =  48,  A.t  =  0.02128,  ^  =  0.5 

where  r^7?(l,X)  is  the  run  time  of  the  serial  version,  solving  with  its  single  processor  a 
problem  of  L  grid  points,  and  Tschemeip-  L)  is  the  run  time  of  the  above  parallel  schemes, 
solving  the  same  problem  with  p  processors. 

We  observe  from  Table  1  that  for  several  processors  (more  than  10)  the  efficiency  is  10% 
-  25  %.  We  emphasize,  however,  that  much  better  preformance  is  obtained  when  several  grid 
points  are  assigned  to  each  processor,  as  shown  in  Table  2.  In  this  case,  the  efficiency  of  the 
AS  scheme  reaches  about  90%  and  more,  for  small  number  of  processors,  while  the  efficiency 
of  the  C.A  scheme  is  about  50%.  It  decreases  slightly  for  the  CA,  and  more  significantly  for 
the  AS,  as  the  number  of  processors  increases.  For  large  number  of  processors,  the  efficiency 
of  CA  is  slightly  higher  than  that  of  the  SY.  In  most  cases,  as  T  increases,  the  efficiency 
of  all  the  parallel  algorithms  slightly  improves.  In  any  case,  the  AS  is  the  most  efficient 
scheme  to  compute  a  fi.xed  number  of  time  steps.  The  synchronization  penalty  is  indicated 
clearly  by  Table  ■],  which  shows  the  run  time  of  the  AS  and  CA  schemes  relative  to  the  SY 
run  time.  As  exp(x:ted,  the  AS  scheme  proves  to  be  the  fastest  scheme.  In  most  cases  it 
is  almost  twice  as  fast  as  the  SY  scheme.  The  CA  scheme  is  faster  than  the  SY  only  for 
large  number  of  processors,  and  even  then,  as  was  mentioned  above,  its  efficiency  is  about 
the  same  as  that  of  the  SY.  In  general,  as  the  number  of  processors  increases,  there  are 
more  independent  entities  to  synchronize,  and  there  is  a  better  chance  for  one  to  delay  the 
rest.  Yet,  the  effect  of  this  increase  is  not  dramatic  if  all  processors  are  hardware  identical 
and  are  synchronized  per  iteration  and  therefore  they  are  executing  more  or  less  the  same 
number  of  iterations.  Note  that,  the  overall  system  lo  '  as  a  direct  impact  on  the  execution 
of  sophisticated  system  tasks  such  as  synchronizatio  ,  vices,  in  multi-users  environments 
such  as  the  Sequent  system.  Thus,  we  may  see  significant  delays  as  the  system  is  loaded 
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with  other  tasks.  This  increases  the  run  time  of  the  SY  scheme,  as  the  number  of  processors 
increases  and  thereby  decreases  the  relative  run  time  of  AS  and  CA,  as  observed  in  Table  3. 
However,  these  delays  also  cause  loss  of  accuracy  of  the  AS  and  the  CA  schemes. 


p 

T  =  2 

AS  CA 

T  =  3 

AS  CA 

T  =  4 

AS  CA 

2 

0.82 

1.5 

wm 

3 

0.8 

1.46 

0.8 

1.46 

Bai 

4 

0.77 

1.43 

isai 

6 

0.72 

1.31 

0.71 

1.29 

8 

0.69 

1.23 

0.69 

1.24 

12 

0.63 

0.99 

iiiiai 

0.99 

16 

0.55 

0.89 

iQm 

0.9 

Table  3:  Run  time  relative  to  SY  for  the  steady-state  one  dimensional  problem  with  ^  points 
per  processor;  all  parameters  are  as  in  Table  2 

Results  for  the  non-homogeneous  two-dimensional  problem  shown  in  Table  4  indicate 
similar  performance.  In  this  case,  we  have  a  non-trivial  steady-state.  The  iterations  are 

calculated  until  \j  — — ^  <  10“^.  Again  the  AS  is  much  faster  than  the  SY  and  the  CA 

is  at  best  as  fast  as  the  SY. 


P 

pts  in 
blk 

Efficiency 

SY  AS  CA 

Rel.  Run  Time 
AS  CA 

2 

8  X  16 

0.66 

1.0 

0.60 

0.59 

1.11 

4 

4  X  16 

0.38 

0.645 

0.35 

0.58 

1.09 

8 

2  X  16 

0.22 

0.38 

0.205 

0.56 

1.07 

16 

1  X  16 

0.11 

0.20 

0.10 

0.55 

1.03 

Table  4:  Efficiency  and  run  time  relative  to  SY  for  the  2-dim.  steady-state  non-homogeneous 
problem;  L  —  256,  Ax  =  Ay  =  0.0667,  -f-  =  0.5;  Calculated  until  yHilhiZElL 

<  10-2 

The  efficiency  and  the  accuracy  of  intermediate  (non  steady-state)  values  are  shown 
in  Table  5.  In  general,  for  relatively  small  times  T,  the  asynchronous  scheme  provides 
the  least  accurate  results.  Therefore,  the  ^lsynchronous  scheme  should  be  applied  only  to 
steady  state  problems.  Note  that  unlike  synchronous  iterations  that  steadily  converge,  the 
convergence  of  asynchronous  iterations  may  not  be  monotone,  thus  affecting  the  stopping 
criteria.  The  efficiency  of  these  schemes,  however,  decreases  as  the  number  of  processors 
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increases.  Then,  the  overhead  of  initiating  another  processor  becomes  more  significant  than 
the  work  it  actually  performs.  The  CA  scheme  is  a  compromise  of  the  two.  It  improves  the 
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E 

SY  AS  CA 
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0.00367 

0.9 
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3 

0.00367 
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■wiiun 
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0.47 

0.05 

0.000026 

0.007 

0.00025 

6 
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0.05 

0.000026 

Em 

8 

0.5 

0.00367 

0.9 

0.04 

0.000026 

0.007 

0.00013 

12 

0.56 

0.9 

0.07 

0.000026 

0.007 

0.00018 

16 

0.9 

0.00012 

Table  5:  Efficiency  and  accuracy  for  the  one  dimensional  problem  and  with  ^  grid  points 
per  processor;  L  =  48,  Ax  =  0.02128,  —  0.5,7’  =  0.5 

accuracy  lost  in  the  asynchronous  version,  by  doing  some  extra  extrapolation  calculations, 
at  the  cost  of  increasing  the  calculation  time.  Note  that  although  CA  is  significantly  more 
accurate  than  AS,  it  is  still  less  accurate  than  the  SY  scheme,  because  of  the  lower  order  of 
its  truncation  error  (O(AXi)  instead  of  (9(AXi^)). 

Two  dimensional  results  are  shown  in  Table  6.  In  this  case,  the  efficiency  of  all  schemes 
proves  to  be  much  better  than  the  one  dimensional  case  (in  some  cases,  it  is  close  to  and 
even  higher  than  one).  The  run  time  improvement  for  the  AS  and  CA  schemes  relative  to 
SY,  is  significant  when  the  communication  is  significant  relative  to  computation.  With  a 
small  number  of  processors  the  synchronization  penalty  is  small  in  comparison  to  the  extra 
computation  needed  by  the  CA  scheme  (as  compared  to  the  AS  and  SY  schemes).  However, 
as  the  number  of  processors  increases  and  there  is  less  work  per  processor,  the  run  time  of 
the  CA  is  close  to  that  of  AS. 

Although  the  intermediate  time-level  results  of  the  CA  scheme  are  significantly  more 
accurate  than  those  of  the  AS,  they  are  still  much  less  accurate  than  those  of  the  SY  scheme. 
This  is  due  to  the  dimensional  additivity  of  the  truncation  error  which,  as  indicated  earlier,  is 
of  lower  order  than  that  of  the  synchronous  scheme.  Hence,  in  multi-dimensional  problems, 
one  should  either  use  a  very  fine  grid,  or  else  higher  order  schemes. 


18 


p 

pts  in 
blk 

Efficiency 

SY  AS  CA 

E 

SY  AS  CA 

e 

SY  AS  CA 

2 

8  X  16 

0.95 

1.05 

0.51 

0.07 

0.99 

0.17 

0.0000036 

0.000051 

0.0000030 

■1 

4  X  16 

0.79 

0.90 

0.45 

0.07 

0.99 

0.5 

0.0000036 

0.000051 

0.000018 

8 

2  X  16 

0.67 

0.81 

0.42 

0.07 

0.97 

0.16 

0.0000036 

0.000049 

0.0000028 

16 

1  X  16 

0.45 

0.60 

0.37 

0.07 

0.99 

0.17 

0.0000036 

0.000051 

0.0000030 

Table  6:  KfUciency  and  accuracy  for  the  2-ditn.  problem;  L  ~  256,  Ax  =  Ay  =  0.0667,  + 


6  Summary 


1  his  paper  presents  asynchronous  (AS)  and  correctc'd-asynchronous  (CA)  finite  differ¬ 
ence  schemes  (based  on  the  Euler  e.xplicit  scheme)  for  the  multi-dinreniional  heat  equation, 
to  be  imidemented  on  MIMD  multiprocessors.  Alihough  we  consider  only  the  heat  equa¬ 
tion,  our  analysis  can  be  easily  modified  and  extended  to  other  parabolic  ])artial  differential 
equations,  and  other  finite  difference  schemes.  Our  schemes  are  analyzed  and  implemented 
on  the  shared-nunnory  multi-user  Sequent  Balance  machine.  They  are  compared  with  the 
corresponding  serial  (SR)  scheme  and  the  parallel  .synchronous  (SY)  scheme.  In  general,  the 
efficiency  of  the  parallel  schemes  increases  as  more  mesh  points  are  assigned  to  each  proces¬ 
sor,  as  the  time-level  increases  and  as  the  problem  dimension  increases.  It  is  proved  that 
the  .'\S  scheme  converges  to  the  solution  of  a  differential  equation  other  than  the  original 
one.  However,  it  provides  accurate  steady-state  results  and  its  efficiency  may  reach  90%  and 
over.  .AS  may  be  almost  twice  as  fast  as  the  parallel  synchronous  (SY)  scheme.  It  is  proved 
that  uidike  the  .AS  scheme,  the  CA  scheme  does  converge  to  the  solution  of  the  original  heat 
ecjualion.  under  certain  rexjuirements.  Its  efficiency,  however,  is  only  about  50%  in  the  best 
cases  consichn’ed  here.  Nevertheless,  CA  offers  more  accurate  results  for  the  intermediate 
(non  steady-state)  time-levels.  Yet,  its  accuracy  is  le.ss  than  the  SY  scheme,  especially  in  the 
multi-dimensional  ca.se,  due  to  the  loss  of  order  in  the  truncation  error  in  each  spatial  co¬ 
ordinate.  Hence,  for  intermediate  Lime-level  results  one  should  cither  use  a  very  fine  spatial 
net.  or  else  use  higher  order  schemes. 
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